\(\rightarrow\) Definition of the project

What have we learned so far?

On financial time series, we concluded that the returns and the standardized residuals are defined as white noise, it’s difficult to predict the future values of portfolio returns looking at past values. This postulate creates a mess up in student mind which are used to build a portfolio based on Markovitz theory. We need to review our entire allocation strategy.

  • Can we build a model to predict the performance of a stock return looking at its companies attributes (Market capitalization, Price-to-Book ratio, 1 month volatility, …)?
  • Can we predict the companies that will outperform the crowd?
  • Can we integrated Machine learning algorithms in trading system?
  • What’re the performance metrics of those strategies compared to a standard benchmark, an equally weighted portfolio?

The aim of this project is to answer all these questions but creating and managing 4 portfolios with different strategies allocation, the standard tangent portfolio of Harry Markovitz and regression based portfolios: Machine learning and simple regression model.

Welcome to our portfolio building strategy :).




1 Data preprocessing

The dataset consists of monthly financial information from January 31st, 1995 to February 26th, 2021 pertaining to 928 stocks around the globe. Each stock are characterized by their:

  • Closing price (Close): Asset’s price of a company at the last trading day of the month
  • Past 1 month volatility (Vol_1M): measure of dispersion of the returns of the company extracted from daily/weekly returns of the last month (expressed in percentage)
  • Market capitalization in M$ (Mkt_Cap): is the market value of a publicly traded company’s outstanding shares
  • Price-to-book ratio (P2B): is a ratio used to compare a company’s current market value to its book value
  • Debt-to-equity (D2E): is a ratio indicating the relative proportion of shareholders’ equity and debt used to finance a company’s assets

1.1 Missing data & imputation

The first step of the study is to process the data. This step is crucial in portfolio building with machine learning. We need to check if the dataset contains missing points and handle it in a efficient way.

if(!require(naniar)){install.packages("naniar", "zoo", "rpart", "randomForest","xgboost", 
                                      "plotly", "PerformanceAnalytics", "fpp2",
                                      "IntroCompFinR", "e1071")}
library(tidyverse)                          # Loading the package/environment
library(dplyr)                              # Grammar of data manipulation
library(naniar)                             # Missing data exploration
library(zoo)                                # Great package for time-series
library(rpart)                              # Simple decision tree package
library(rpart.plot)                         # Tree plot package
library(randomForest)                       # randomForest regression
library(xgboost)                            # Boosted trees regression 
library(plotly)                             # Interactive graphs
library(PerformanceAnalytics)               # To assess portfolio performance
library(fpp2)                               # moving average
library(IntroCompFinR)                      # MSR portfolio
library(e1071)                              # Statistic metrics kth central moment

load("data_full.RData")                     # Load the dataset
data_full                                   # Show the data
data_full %>% summary                       # Summary of the data
##      Tick                Date                Close               Vol_1M       
##  Length:291392      Min.   :1995-01-31   Min.   :     0.00   Min.   :   0.00  
##  Class :character   1st Qu.:2001-07-31   1st Qu.:     7.76   1st Qu.:  21.62  
##  Mode  :character   Median :2008-02-14   Median :    16.65   Median :  30.88  
##                     Mean   :2008-02-13   Mean   :    64.74   Mean   :  40.23  
##                     3rd Qu.:2014-08-29   3rd Qu.:    33.65   3rd Qu.:  46.31  
##                     Max.   :2021-02-26   Max.   :180468.76   Max.   :5052.37  
##                                                              NA's   :3416     
##     Mkt_Cap               P2B                D2E           
##  Min.   :      0.0   Min.   :0.00e+00   Min.   :      0.0  
##  1st Qu.:    356.6   1st Qu.:9.80e-01   1st Qu.:     16.9  
##  Median :   1486.0   Median :1.59e+00   Median :     51.4  
##  Mean   :  12405.4   Mean   :5.46e+00   Mean   :    220.9  
##  3rd Qu.:   6957.6   3rd Qu.:2.81e+00   3rd Qu.:    105.7  
##  Max.   :2255969.1   Max.   :1.24e+05   Max.   :1188136.9  
##  NA's   :5168        NA's   :22664      NA's   :12371
#The view of the summary table of the dataset highlights that we have missing points in the columns
#Vol_1M, Mkt_Cap, P2B and D2E of the dataframe. Which variables are most affected and in what
#proportion? 

gg_miss_var(data_full)                      # Proportion of missing points per column (NA)

The two most affected columns are indeed P2B (price-to-book) and D2E (debt-to-equity). To avoid a concentration of data at some points after the scaling process, we’ll drop the Tick where no information are provided (entire column of a feature is empty). This will result to a loss of 5% of data which is an acceptable value. Whatever we have to fill the remain dataframe for the backtesting. To do this, we’ll use two techniques:

  • times-series imputation: for Mkt_Cap and Vol_1M column: assuming that if the company haven’t published the data for a specific month, we can be conservative and keep the last preceding values.
  • cross-sectional imputation technique: for P2B and D2E: here, we assign the median score (more robust than the mean, because less affected by outliers), as if the missing value for the company is equal to a representative value.
data_full <- data_full %>% 
  group_by(Tick) %>%                                    # Perform the operation for each stock
  mutate(Mkt_Cap = na.locf(Mkt_Cap, na.rm = F),         # Replace the feature by previous values
         Vol_1M = na.locf(Vol_1M, na.rm = F),
         P2B = na.locf(P2B, na.rm = F),
         D2E = na.locf(D2E, na.rm = F)) %>%
  filter(!all(is.na(P2B)), !all(is.na(Mkt_Cap))) %>%
  ungroup() %>%                                         # Ungroup
  group_by(Date) %>%                                    # Perform the operation for EACH date
  mutate(P2B = if_else(is.na(P2B), median(P2B, na.rm = T), P2B),
         D2E = if_else(is.na(D2E), median(D2E, na.rm = T), D2E),
         Mkt_Cap = if_else(is.na(Mkt_Cap), median(Mkt_Cap, na.rm = T), Mkt_Cap),
         Vol_1M = if_else(is.na(Vol_1M), median(Vol_1M, na.rm = T), Vol_1M)) 
                                                        # Replace the feature value by the median at each date if NA
data_full %>% summary()
##      Tick                Date                Close               Vol_1M       
##  Length:276948      Min.   :1995-01-31   Min.   :     0.00   Min.   :   0.00  
##  Class :character   1st Qu.:2001-07-31   1st Qu.:     7.90   1st Qu.:  21.84  
##  Mode  :character   Median :2008-02-14   Median :    16.87   Median :  31.25  
##                     Mean   :2008-02-13   Mean   :    66.98   Mean   :  41.82  
##                     3rd Qu.:2014-08-29   3rd Qu.:    34.17   3rd Qu.:  47.05  
##                     Max.   :2021-02-26   Max.   :180468.76   Max.   :5052.37  
##     Mkt_Cap               P2B                D2E           
##  Min.   :      0.0   Min.   :0.00e+00   Min.   :      0.0  
##  1st Qu.:    351.4   1st Qu.:9.90e-01   1st Qu.:     16.8  
##  Median :   1378.7   Median :1.59e+00   Median :     51.1  
##  Mean   :  11239.3   Mean   :6.74e+00   Mean   :    300.0  
##  3rd Qu.:   6047.9   3rd Qu.:2.84e+00   3rd Qu.:    106.4  
##  Max.   :2255969.1   Max.   :1.24e+05   Max.   :1188136.9

1.2 Feature engineering

The descriptive statistics of the features highlights a dispersion between the median and both the minimum and maximum (presence of outliers?). For instance, the maximum P2B is \(75*10^3\) times the median: this is a very huge number. The same observation holds for D2E. If these values an errors or due to very small values in the denominator? A low Equity compared to Debt or a low book value compared to market value. Also, there’s a huge discrepancies between them, Mkt_Cap \(10^6\), Vol_1M, D2E and P2B \(10^2\).

So, we have to rescale the predictors (features). We choose to work with positive values, but this is without much loss of generality. We scale the 4 variables with the ‘uniformisation method’: \(\tilde{x}_i=ecdf(x)\), where ecdf is the empirical cumulative distribution function of the sample.

normalise <-  function(v){                             # This is a function that 'uniformalises' a vector
  v <- v %>% as.matrix()
  return(ecdf(v)(v))
}

data_full <- data_full %>%
  group_by(Tick) %>%
  mutate(F_return = lead(Close) / Close - 1,           # Compute forward returns
         P_return = lag(F_return)) %>%                 # Compute past returns
  na.omit() %>%                                        # Take out missing data
  ungroup() %>%
  group_by(Date) %>%
  mutate(FR_return = F_return - mean(F_return)) %>%    # Relative return vs EW market
  ungroup()
 
data_u <- data_full %>%                                # Data with normalized values
  group_by(Date) %>%                                   # Normalization takes place for each date
  mutate_at(vars(Vol_1M:D2E), normalise) %>%           # Apply normalization on features
  ungroup() 

data_u %>% 
  select(Vol_1M, Mkt_Cap, P2B, D2E) %>%                # select necessary columns
  pivot_longer(cols = everything(),                    # Convert to 'compact' mode
                 names_to = 'Attribute', values_to = 'Value') %>%  
  ggplot(aes(x = Value, fill = Attribute)) + 
  geom_histogram() + theme_light() +                   # Plot histograms
  facet_grid(Attribute~.)                              # Stack the histograms

The distribution of the variables is highly affected by the scaling choice, we can observe a peak on D2E (due to outliers?) which will have a certain impact on the regression process. Although, we still keep it because its particular behavior can explained some patterns observed in the forecasting process. For the rest of the features the scaling seems ok. With no missing values and scaled data, we can start our portfolio construction process.

2 Machine learning & Portfolio allocation

By definition, a portfolio is a choice of weights that sum to one. To assess realistic strategies, we’ll implement five classical portfolio, all based on different strategies most encountered in hedge funds. The weighting scheme follows signals led either by past returns or companies features. Our strategies are the following:

  • Equally weighted portfolio: the uniform portfolio which serves as benchmark. The EW weights are \(w=\frac{1}{N}\).

  • Maximum sharpe ratio portfolio: weighting scheme follow Markovtiz theory. It consists of a well-balanced maximization of the risk premium while reducing the overall volatility, hence leading to the highest sharpe ratio. It’s rebalanced every month according to new financial evolution and perspective of the companies and requires as inputs the (expected) mean returns \(\mu\) and covariance matrix \(\Sigma\) of the assets since 1995. The MSR weights are \(w=\frac{\Sigma^{-1}\mu}{1'\Sigma^{-1}\mu}\). Both \(\mu\) and 1 are vectors here.

  • Regression based-portfolio: we performed regression on asset’s returns with past data and the we forecast the future returns and choose to invest equally in the half of the assets that have forecasts above the cross-sectional median return. We take into account relative performance and not raw performance (the threshold is the median and not just zero). \(w=\frac{1}{N} (for \hat{Y} \ge 75\% -quantile(\hat{Y}))\).

  • XGBoosted trees: eXtreme Gradient boosting (XGBoost), is an aggregation of trees. Each new tree is constructed in an optimal way such that we end up to minimize the training error. Just as the regression method, we use relative performance. \(w=\frac{1}{N} (for \hat{Y} \ge 75\% -quantile(\hat{Y}))\).

  • Random Forest: are compilations of decision trees with randomly chosen predictors. In our sample, we have 4 predictors (Mkt_Cap, P2B, Vol_1M, D2E). In the allocation process, we combine randomly subsets of these features and grow as many trees as possible. We use the same proportion as the XGBoosted trees and regression: \(w=\frac{1}{N} (for \hat{Y} \ge 75\% -quantile(\hat{Y}))\).

2.1 XGBoosted trees

Which feature guarantees a good splitting? To answer this question, we’ll build and plot two decision trees and compare them. One tree will be built on the original dataset and the other on the normalized data.

features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E")            # Predictors
sep_date <- as.Date("2015-12-31")               # Train vs test separation date
train_sample <- data_full %>%                   # Train sample
  filter(Date <= sep_date)
test_sample <- data_full %>%                    # Test sample
  filter(Date > sep_date) %>%
  na.omit() 

fit <- train_sample %>%                                     # Initial data
  select(features, FR_return) %>%               # Select only the predictors
  rpart(FR_return ~ .,                          # Model: F_Return as a function of all other variables
        cp = 0.00001,                           # Complexity: lower means more nodes/levels
        maxdepth = 7,                           # Maximum depth (nb levels) of the tree
        data = .,                               # Data source: from the pipe
        model = TRUE)                              
rpart.plot(fit)                                 # Plot

The bold text details the splitting variables and points. They are displayed just below the clusters which they divide in two. The first split is made according to the Market Capitalization (Mkt_Cap) variable at level 0.0039. Those instances with Mkt_Cap larger or equal to 0.0039 go to the cluster to the left (almost all the dataset) and those with Mkt_Cap strictly smaller than 0.0039 go to the cluster to the right (less than 1% of the dataset). *.

With a dataset of 881 stocks, to achieve a certain level of splitting, we need to impose a smaller complexity \(10^{-5}\) with a maximum depth of 7 layers. Nevertheless, this design still incorporate a big clustering of 95% of the proportion of the sample with average forward relative return of -0.0055. The split captures a micro-phenomenon related to rebound of company after they experience a large drop in share price Synthetic Biologics Inc (SYN) in 2006, Cytocore common stock (MDIT) in 1998, this leads to average forward relative return from 7.3 to 27 which will not be meaningful in terms of generalization out-of-sample.

pred <- predict(fit, test_sample)           # Predictions
mean(pred)-mean(test_sample$FR_return)      # Bias
## [1] 0.007868397
var(pred)  
## [1] 0.08640867

the first number, the bias shows the error in average prediction while the second, the variance one measures the dispersion of predictions. There are only 14 observations in this set, which is a perfect illustration of the variance-bias trade-off: the bias which is small (we stick to the data), but the variance is huge.

Below, we perform the same analysis on the formatted/normalized data and observe the difference. We also remove two layers of the tree since it isn’t significant.

train_sample <- data_u %>%                     # Train sample
  filter(Date <= sep_date)
test_sample <- data_u %>%                      # Test sample
  filter(Date > sep_date) %>%
  na.omit()

fit <- train_sample %>%                        # Normalized data
  select(features, FR_return) %>%              # Select only the predictors
  rpart(FR_return ~ .,                         # Model: F_Return as a function of all other variables
        cp = 0.00001,                          # Complexity: lower means more nodes/levels
        maxdepth = 5,                          # Maximum depth (nb levels) of the tree
        data = .,                              # Data source: from the pipe
        model = TRUE)                              
rpart.plot(fit)                                # Plot

With 5 layers, we end to a more efficient splitting than the non-normalized dataset. Whatever the big cluster with an average forward relative return still contains more than 95% of the dataset.

pred <- predict(fit, test_sample)           # Predictions
mean(pred)-mean(test_sample$FR_return)      # Bias
## [1] -9.506742e-05
var(pred)  
## [1] 0.01329928

The problem of the variance-bias trade-off persists on the dataset. The magnitude of bias is divided by 82.77 in absolute value (from 0.79% to -0.95 basis point) while the variance is divided by 6.50 (from 8.64% to 1.33%). To improve the predictions, we have a look on the impact of hyperparameters on model performance and select the best ones.

2.1.1 Tuning of hyperparameters

\(\rightarrow\)Grid search

We start by preparing the inputs. Because we work with the simple (scaled) features, the dependent variable is the 1M (smooth) return since the portfolio is rebalanced monthly. Additionally, we will in fact use the return relative to the average performance (FL_return variable), to select the asset above the EW benchmark.

We recall some key hyperparameters of boosted trees:

  • eta: the learning rate: a scaling factor applied to each new tree added to the model
  • lambda: the weight regulariser: it penalizes the objective function through the total sum of squared weights/scores
  • number of rounds: number of trees in the model
  • maximum depth: number of layer in the model, we stick to 4 for simplicity purposes
  • gamma: complexity penalisation: the loss reduction required to make a further partition on a leaf node of the tree
train_features <- train_sample %>% select(features) %>% as.matrix()      # Independent variables
train_label <- train_sample$FR_return                                    # Dependent variable
train_matrix <- xgb.DMatrix(data = train_features, label = train_label)  # XGB format
test_features <- test_sample %>% select(features) %>% as.matrix()        # Independent variables
test_label <- test_sample$FR_return                                      # Dependent variable

eta <- c(0.2, 0.4, 0.6, 0.8, 0.9)         # eta values
nrounds <- c(10, 25, 50, 75, 90)          # nb of trees 
lambda <- c(0.01, 0.1, 1, 10, 100)        # lambda values
pars <- expand.grid(eta, nrounds, lambda) # Exploring all combinations!
eta <- pars[,1]
nrounds <- pars[,2]
lambda <- pars[,3]

Below, we define the function that we will use for the grid search and test the triplet of parameters values.

grid_par <- function(eta, nrounds, lambda, train_matrix, test_features, test_label){
  fit <- train_matrix %>% 
    xgb.train(data = .,                       # Data source (pipe input)
              eta = eta,                      # Learning rate
              objective = "reg:squarederror", # Objective function
              max_depth = 4,                  # Maximum depth of trees
              lambda = lambda,                # Penalization of leaf values
              gamma = 0,                      # Penalization of number of leaves
              nrounds = nrounds,              # Number of trees used
              verbose = 0                     # No comment
    )
  
  pred <- predict(fit, test_features)         # Predictions based on fitted model & test values
  return(mean((pred-test_label)^2))           # Mean squared error
}

grd <- pmap(list(eta, nrounds, lambda),               # Parameters for the grid search
            grid_par,                                 # Function on which to apply the grid search
            train_matrix = train_matrix,              # Input for function: training data
            test_features = test_features,            # Input for function: test features
            test_label = test_label                   # Input for function: test labels (returns)
)

grd <- data.frame(eta, nrounds, lambda, error = unlist(grd)) # Dataframe with all results
grd$eta <- as.factor(eta)                                    # Parameters as categories (for plotting)
grd %>% ggplot(aes(x = eta, y = error, fill = eta)) +        # Plot!
  geom_col() + theme_light() + 
  facet_grid(rows = vars(nrounds), cols = vars(lambda)) +
  geom_hline(yintercept = var(test_label), linetype = 2) 

var(test_label)                                              # Benchmark = zero prediction
## [1] 0.0183818

If we predicted a zero return all the times, we would make a quadratic error of 1.84%. From the facet grid, there is an emerging pattern: lambda (from left to right) and eta (the x-axis), which all seem to be the main drivers of the error at micro level. With lambda = 10-100, eta = 0.4-0.6 and nrounds = 10-25, the errors of prediction are very closed to the benchmark. We can learn patterns with those specific values.

\(\rightarrow\)Separating loss from performance

Since we rebalance our portfolio every month, we need to a better indicator: the hit ratio. It’s the percentage of times that we’re in the right direction, prediction of true positive (the hit ratio is positive) and false negative (the hit ratio is negative).

grid_par_hit <- function(eta, nrounds, lambda, train_matrix, test_features, test_label){
  fit <- train_matrix %>% 
    xgb.train(data = .,                       # Data source (pipe input)
              eta = eta,                      # Learning rate
              objective = "reg:squarederror", # Objective function
              max_depth = 4,                  # Maximum depth of trees
              lambda = lambda,                # Penalization of leaf values
              gamma = 0,                      # Penalization of number of leaves
              nrounds = nrounds,              # Number of trees used
              verbose = 0                     # No comment
    )
  
  pred <- predict(fit, test_features)                # Predictions based on fitted model & test values
  return(mean(pred*test_label>0))                    # Hit ratio!
}

test_label_hit <- test_sample$FR_return              # Dependent variable => very different!!! Big trick!

grd <- pmap(list(eta, nrounds, lambda),              # Parameters for the grid search
            grid_par_hit,                            # Function of the grid search
            train_matrix = train_matrix,             # Input for function: training data
            test_features = test_features,           # Input for function: test features
            test_label = test_label_hit              # Input for function: test labels (returns)
)
grd <- data.frame(eta, nrounds, lambda, hit = unlist(grd)) # Dataframe with all results
grd$eta <- as.factor(eta)                                  # Parameters as categories (for plotting)
                                                           # Plot!
graph <- grd %>% ggplot(aes(x = eta, y = hit, color = eta)) +       
  geom_point() + theme_light() + 
  ylim(0.45,0.53) +
  geom_hline(yintercept = mean(test_label_hit>0), color = "#646464") + # Benchmark value (horiz line)
  facet_grid(rows = vars(nrounds), cols = vars(lambda))
ggplotly(graph) %>%                                                # Ease the reading
  layout(legend = list(orientation = "v", x = 1.05, y = 1))
mean(test_label_hit>0)    # Benchmark: assuming constant positive return (e.g., EW portfolio!)
## [1] 0.4692019
max(grd$hit)              # Best hit ratio from model
## [1] 0.5251292

All hit ratios are above those of the benchmark of 46.92%. Indeed, many of them are above 50% (with a maximum hit at 52.51%) which is a good sign. We obtain even better result with the same parameters as mentioned above, eta = 0.6, lambda = 10, nrounds = 25. Whatever, this performance doesn’t take into account the transaction costs. They should be computed to make sure that the performance does not vanish with trading fees.

These hyperparameters of the XGBoosted trees model are now fixed, we can focus on the grid search of the Random forests model.

2.2 Random forests

Random forests are more complex and richer than simple trees. They are compilations of decision trees with randomly chosen predictors and don’t have a huge tendency to overfit since each tree is build randomly. Thus, most of the time, their performance is higher.

2.2.1 Tuning of hyperparameters

\(\rightarrow\)Grid search

We recall some key hyperparameters of Random forests:

  • ntree: the number of random trees.
  • sampsize: training sample size for each tree.
  • nodesize: is the minimum size of terminal nodes. A larger nodesize causes smaller trees to be grown and thus take less time.
  • mtry: number of predictive variables used for each tree, we stick to 4 (all the features) because we don’t want to create an overfit by selecting the same set of variables twice.

With random forest, we can’t test too much combination of parameters since it’s time consuming. Eg: If we select as the XGBoosted 5 values per parameters, with the computation time around 15s per triplet it takes \(\sim 0.25*5^3 = 31.25\) minutes (more than half hour). The nodesize is the most impactful parameter in computation time.

train_data <- train_sample %>% select(features, FR_return)               # Dependent variable
test_features <- test_sample %>% select(features)                        # Independent variables
test_label <- test_sample$FR_return                                      # Dependent variable

nodesize <- c(1, 3, 5, 10)             # nodesize values
sampsize <- c(750, 1500, 3000, 5000)   # sampsize of trees 
ntree <- c(12, 25, 50, 100)            # ntree values
pars <- expand.grid(nodesize, sampsize, ntree) # Exploring all combinations!
nodesize <- pars[,1]
sampsize <- pars[,2]
ntree <- pars[,3]

Below, we define the function that we will use for the grid search and test the triplet of parameters values.

grid_par_rdf <- function(nodesize, sampsize, ntree, train_data, test_features, test_label){ 
  set.seed(42)                                  # Fixing random seed
  fit <- train_data %>%                         # Normalized data
  select(features, FR_return) %>%               # Take out non predictive variables
  randomForest(FR_return~.,                     # Same formula as for simple trees!
               data = .,                        # Data source (pipe input)
               ntree = ntree,                   # Number of random trees
               mtry = 4,                        # Number of predictive variables used for each tree
               replace = FALSE,                 # All different instances
               sampsize = sampsize,             # Training sample size for each tree
               nodesize = nodesize              # Minimum size of terminal nodes
  )
  pred <- predict(fit, test_features)           # Prediction
  return(mean((pred-test_label)^2))             # Mean squared error
}

grd <- pmap(list(nodesize, sampsize, ntree),    # Parameters for the grid search
            grid_par_rdf,                       # Function on which to apply the grid search
            train_data = train_data,            # Input for function: training data
            test_features = test_features,      # Input for function: test features
            test_label = test_label             # Input for function: test labels (returns)
)

grd <- data.frame(nodesize, sampsize, ntree, error = unlist(grd)) # Dataframe with all results
grd$nodesize <- as.factor(nodesize)                               # Parameters as categories (for plotting)
grd %>% ggplot(aes(x = nodesize, y = error, fill = nodesize)) +   # Plot!
  geom_col() + theme_light() + 
  facet_grid(rows = vars(sampsize), cols = vars(ntree)) +
  geom_hline(yintercept = var(test_label), linetype = 2) 

var(test_label)                                                   # Benchmark = zero prediction
## [1] 0.0183818

The quadratic error of 1.84% doesn’t change since we use the same test sample. From the facet grid, there is an emerging pattern, the most impacting variables are the sampsize (from top to bottom) and the nodesize (the x-axis). With sampsize = 1500-3000 and nodesize = 3 & 10, the errors of prediction are very closed to the benchmark. Now, let’s observe if this set have the same impact when applied with the hit ratio.

\(\rightarrow\)Separating loss from performance

We define a function with random forest setup.

grid_par_hit_rdf <- function(nodesize, sampsize, ntree, train_data, test_features, test_label){
  set.seed(42)                                  # Fixing random seed
  fit <- train_data %>%                         # Normalized data
  select(features, FR_return) %>%               # Take out non predictive variables
  randomForest(FR_return~.,                     # Same formula as for simple trees!
               data = .,                        # Data source (pipe input)
               ntree = ntree,                   # Number of random trees
               mtry = 4,                        # Number of predictive variables used for each tree
               replace = FALSE,                 # All different instances
               sampsize = sampsize,             # Training sample size for each tree
               nodesize = nodesize              # Minimum size of terminal nodes
  )
  pred <- predict(fit, test_features)           # Prediction
  return(mean(pred*test_label>0))               # Hit ratio!
}

grd <- pmap(list(nodesize, sampsize, ntree),    # Parameters for the grid search
            grid_par_hit_rdf,                   # Function of the grid search
            train_data = train_data,            # Input for function: training data
            test_features = test_features,      # Input for function: test features
            test_label = test_label             # Input for function: test labels (returns)
)
grd <- data.frame(nodesize, sampsize, ntree, hit = unlist(grd)) # Dataframe with all results
grd$nodesize <- as.factor(nodesize)                             # Parameters as categories (for plotting)
                                                                # Plot!
graph <- grd %>% ggplot(aes(x = nodesize, y = hit, color = nodesize)) +  
  geom_point() + theme_light() + 
  ylim(0.45,0.53) +
  geom_hline(yintercept = mean(test_label>0), color = "#646464") + # Benchmark value (horiz line)
  facet_grid(rows = vars(sampsize), cols = vars(ntree))
ggplotly(graph) %>%                                                # Ease the reading
  layout(legend = list(orientation = "v", x = 1.05, y = 1))
mean(test_label>0)    # Benchmark: assuming constant positive return (e.g., EW portfolio!)
## [1] 0.4692019
max(grd$hit)          # Best hit ratio from models
## [1] 0.5121743

All hit ratios are above those of the benchmark of 46.92% just like the XGBoosted model. Indeed, all of them are above 50% (with a maximum hit at 51.22%) which is a good sign. However, it’s still nonsignificant in practice since it doesn’t take into account the transaction costs. From the results of the graph, we select nodesize = 3, sampsize = 3000 and ntrees = 100

These hyperparameters of the random forest trees model are now fixed, we can properly design the weighting function with the fixed hyperparameters.

2.3 The weighting function

We now turn to the definition of the weighting scheme. We wish to compare all the strategies, trees-based and MSR to the usual equally-weighted (EW) portfolio. We use the xgboost and randomForest algorithms to forecast future returns and form an EW portfolio of the best stocks with the highest predictions: 221 companies or equivalently, the top 25% (the sample has 882 stocks).

weights_reg <- function(past_data, curr_data, thresh){                     # Weighting function based on regression
  past_data <- past_data %>% select(features, F_return)                    # Data prior today
  curr_data <- curr_data %>% select(features)                              # Today data
  fit <- lm(F_return ~ ., 
            data = past_data)                    # Training on past data
        pred <- predict(fit, curr_data)          # Prediction
        w <- pred > quantile(pred, thresh)       # Investment decision: TRUE  = 1
        return(w / sum(w))
}

weights_xgb <- function(train_data, test_data, thresh){                     # Weighting function based on                                                                               ML process: XGBOOSTED trees 
  train_features <- train_data %>% select(features) %>% as.matrix()         # Indep. variable
  train_label <- train_data$FR_return                                       # Dependent variable
  train_matrix <- xgb.DMatrix(data = train_features, label = train_label)   # XGB format
  fit <- train_matrix %>% 
      xgb.train(data = .,                       # Data source (pipe input)
                eta = 0.6,                      # Learning rate (chosen with tuning model)
                objective = "reg:squarederror", # Number of random trees
                max_depth = 4,                  # Maximum depth of trees (fixed)
                nrounds = 25,                   # Number of trees used (chosen with tuning model)
                lambda = 10,                    # Penalization of leaf values (chosen with tuning model)
                gamma = 0                       # Penalization of number of leaves (fixed)
      )
  xgb_test <- test_data %>%                     # Test sample => XGB format
      select(features) %>% 
      as.matrix() %>%
      xgb.DMatrix()
    
    pred <- predict(fit, xgb_test)              # Single prediction
    w <- pred > quantile(pred, thresh)          # Top 25% largest values, equally-weighted
    return(w / sum(w))
}

weights_rdf <- function(train_data, test_data, thresh){ 
                                                # Weighting function based on ML process: Random Forest
  test_features <- test_data %>%                # Take out non predictive variables
    select(features)                 
  
  set.seed(42)                                  # Fixing random seed
  fit <- train_data %>%                         # Normalized data
  select(features, FR_return) %>%               # Take out non predictive variables
  randomForest(FR_return~.,                     # Same formula as for simple trees!
               data = .,                        # Data source (pipe input)
               ntree = 100,                     # Number of random trees (fixed)
               mtry = 4,                        # Number of predictive variables used for each tree (chosen with tuning model)
               replace = FALSE,                 # All different instances
               sampsize = 3000,                 # Training sample size for each tree (chosen with tuning model)
               nodesize = 3                     # Minimum size of terminal nodes (chosen with tuning model)
  )
  pred <- predict(fit, test_features)           # Prediction
  w <- pred > quantile(pred, thresh)            # Top 25% largest values, equally-weighted
  return(w / sum(w))
}

weights_msr <- function (returns){              # Weighting function based on Markovtiz theory: Maximum Sharpe ratio
  ER <- returns %>% 
    select(Tick, Date, P_return) %>%
    pivot_wider(names_from = Tick, values_from = P_return) %>%
    head(-1) %>%                                # Select data prior the separation date to avoid forward looking
    select(-Date) %>%
    as.matrix()                                 # Matrix of returns
  Cov_mat <- ER %>% cov()                       # Covariance matrix
  Pi <- tangency.portfolio(er = apply(ER, 2, mean), # Allocation function
                                cov.mat = Cov_mat, 
                                risk.free = 0, 
                                shorts = FALSE)
  return(Pi$weights)
} 
weights_multi <- function(past_data, curr_data, train_data, test_data, thresh, j){           
                                                    # Strategies are indexed by j
  if(j == 1){                                       # j = 1 => EW
    return(rep(1/N,N))
  }  

  if(j == 2){                                       # j = 2 => regression-based
    return(weights_reg(past_data, curr_data, thresh))    
  }
  if(j == 3){                                       # j = 3 => XGBoosted tree-based
    return(weights_xgb(train_data, test_data, thresh))
  }
  if(j == 4){                                       # j = 4 => random Forest based
  return(weights_rdf(train_data, test_data, thresh))
  }
}

3 Portfolio backtesting

Backtesting portfolio strategies is usually simple at first, but becomes more intricate when many options are considered. We need to aggregate several arrays and indexing each frame to store weights and returns at exact place. This process becomes more complicated with the MSR strategy. To compute the weights, we should solve the covariance matrix \(\Sigma\), but it’s only possible for non singular matrix (more columns than rows). So, we have to first process a screening of stocks focusing on trend deriving from technical analysis before allocate the weights.

3.1 Backtesting

t_all <- unique(data_u$Date)                     # Set of dates
tick <- unique(data_u$Tick)                      # Set of assets
sep_date <- as.Date("2016-01-31")                # This date separates in-sample vs out-of-sample
t_oos <- t_all[t_all > sep_date]                 # Out-of-sample dates (i.e., testing set)

Tt <- length(t_oos) - 1 
nb_port <- 4                                                    # Nb of portfolios
portf_weights <- array(0, dim = c(Tt, nb_port, length(tick)))   # Store weights
portf_returns <- matrix(0, nrow = Tt, ncol = nb_port)           # Store returns

Here we backtested the 4 first strategies with a loop:

strat <- c("EW", "REGRESSION", "XGBOOSTED", "RANDOMFOREST")
train_window <- 365 * 5 + 31
thresh <- 0.75                                           # Threshold on model

returns <- data_u %>%                                    # Take data
  select(Tick, Date, P_return) %>%                       # Select 3 columns
  pivot_wider(names_from = Tick, values_from = P_return) # Put them into 'matrix' format

ret <- data_u %>%
  group_by(Tick) %>%
  mutate(Mov_avg = ma(P_return, order = 5)) 

for(t in 1:Tt){
    realised_returns <- returns %>%                      # Same third step: returns
      filter(Date ==  t_oos[t]) %>% 
      select(-Date)
    
    train_data <- data_u %>% 
      filter(Date < t_oos[t] - 31,                       # One month offset because FR_Return  
              Date >= t_oos[t] - train_window)           # Rolling window 5 years !!!
    
    test_data <- data_u %>% filter(Date == t_oos[t])     # Current values
    N <- test_data$Tick %>% n_distinct()                 # Numbers of assets
    
    past_data <- data_u %>%                              # The source is data_u!
          filter(Date < t_oos[t]) %>%                    # Past dates: expanding window!
          select(-Date)                                  # Take out dates
    
    curr_data <- data_u %>%                              # The source is data_u!
          filter(Date == t_oos[t]) %>%                   # Current date
          select(-Date)      
    
    for(j in 1:nb_port){                                 # This is the novelty: we loop on the strategies 
        portf_weights[t,j,] <- weights_multi(past_data, curr_data, 
                                             train_data, test_data, 
                                             thresh, j)  
                                                         # The weights' function is indexed by j
        portf_returns[t,j] <- sum(portf_weights[t,j,] * realised_returns)
      }
  }

We defined a proper process on MSR strategy and proceed on backtesting.

strat_all <- c(strat, "MSR")
portf_weights_msr <- array(0, dim = c(Tt, 1, round((N+1)/4)))  # Store weights
portf_returns_msr <- matrix(0, nrow = Tt, ncol = 1)            # Store returns

ret <- data_u %>%
  group_by(Tick) %>%
  mutate(Mov_avg = ma(P_return, order = 5)) %>%         # Moving Average of order 5
  group_by(Date) %>%
  mutate(Mov_avg_75 = quantile(Mov_avg, 0.75, na.rm = T))          

k <- returns %>% dim()                                  # Dimension of dataframe 
  
  for(t in 1:Tt){
  l <- returns %>% 
    filter(Date < t_oos[t]) %>%
    dim()                                               # Dimension of dataframe    
  
  temp_data <- ret %>%                                  # data prior the out of sample date
    filter(Date <= t_oos[t]) %>%
    group_by(Tick) %>%
    filter(Mov_avg[l[1]-2] > Mov_avg_75[l[1]-2]) %>%          
    ungroup()                                           # We are looking at corporate with moving average
                                                        #greater than the 75% quantile one month ago. 
  
  realised_returns <- temp_data %>%                     # realized returns at each date
      select(Tick, Date, P_return) %>%                  # Select 3 columns
      pivot_wider(names_from = Tick, values_from = P_return) %>%
      filter(Date ==  t_oos[t]) %>%                      # realized returns at date t
      select(-Date)
    
    portf_weights_msr[t,1,] <- weights_msr(temp_data)   # Keep the same weights for the entire holding period
    portf_returns_msr[t,] <- sum(portf_weights_msr[t,1,] * realised_returns)
  }

With all the strategies backtested, we can now assess the performance metrics.

3.2 Performance metrics

To compare all the strategies, we’ll use an annual frequency performances. We’ll compute and focus on the following:

  • Average return: \(\bar{\mu}\) is the simple mathematical average of a strategy’ series of returns over a given period.

  • Volatility:\(\sigma\) is the statistical measure of the dispersion of returns for a given strategy. In most cases, the higher the volatility, the riskier the strategy.

  • Sharpe ratio: is the ratio of the average return earned in excess of the risk-free rate per unit of volatility or total risk. In general, the greater the Sharpe ratio, the more attractive is the strategy (we’re more rewarded on additional risk taken): \(\frac{\bar{\mu} - r_f}{\sigma}\)

  • Value at risk: is a measure of the risk of loss for investments. It estimates how much the strategy might lose at 95% probability, given normal market conditions, in a set time period: \(VaR_{\alpha} = \text{inf [V : P(X} \leq V) \ge \alpha]\)

  • Maximum Drawdown: is the maximum observed loss from a peak to a trough of a portfolio, before a new peak is attained. It’s an indicator of downside risk over a specified time period. \(MDD = \frac{\text{Trough value - Peak value}}{\text{Peak value}}\)

  • Turnover: is a measure of how frequently assets within a fund are bought and sold. It assesses asset rotation and thus impacts transaction costs. Simple turnover computes average absolute variation in portfolio weights: \(\text{Turn}=\frac{1}{T}\sum_{t=2}^T\sum_{n=1}^N|w_t^n-w_{t-1}^n|\). Full turnover takes into account the variation of the weights between rebalancing dates: \(\text{Turn}=\frac{1}{T}\sum_{t=2}^T\sum_{n=1}^N|w_t^n-w_{t-}^n|\), where \(t-\) is the time just before the rebalancing date. We’ll use the full turnover.

  • Transaction cost-adjusted SR: It’s the turnover taking into account the sharpe ratio. The higher the turnover and/or the lower is the sharpe ratio, the more penalized is the transaction cost adjusted, \(TC-SR=(\bar{r}-0.005*Turn)/\sigma\).

  • Beta: is a measure of the volatility or systematic risk of the portfolio compared to the market as a whole: \(\beta = \frac{cov(\mu,r_m)}{var(r_m)}\)

  • Alpha in CAPM: is a measure of the active return on an investment, the performance of that investment compared with a suitable market index (EW in our case).\(\alpha = (\bar{\mu} - r_f)- \beta * (\bar{r}_m - r_f)\) with \(r_f\) the risk-free rate and \(\bar{r}_m\) the average return of the benchmark.

perf_met <- function(portf_returns, p_weights, asset_returns, t_oos){
    Avg_ret <- mean(portf_returns, na.rm = T)*12                # Arithmetic mean 
    Vol <- sd(portf_returns, na.rm = T)*sqrt(12)                # Volatility
    Sharpe_ratio <- Avg_ret / Vol                               # Sharpe ratio
    VaR_5 <- quantile(portf_returns, 0.05)*sqrt(12)             # Value-at-risk
    maxDD <- maxDrawdown(portf_returns)                         # maximum Drawdown
    Turnover <- 0                                               # Initialization of turnover
    for(t in 2:(length(t_oos)-1)){
        realized_returns <- asset_returns %>% filter(Date == t_oos[t]) %>% select(-Date)
        prior_weights <- p_weights[t-1,] * (1 + realized_returns)
        if(sum(prior_weights)>0){
            Turnover <- Turnover + sum(abs(p_weights[t,] - prior_weights/sum(prior_weights)))
        }
    }
    Turnover <- 12*Turnover/(Tt - 1)                            # Average over time
    Trans_cost_SR <- (Avg_ret - 0.005*Turnover) / Vol           # Transaction cost-adjusted SR
    met <- data.frame(Avg_ret, Vol, Sharpe_ratio, VaR_5, 
                      maxDD, Turnover, Trans_cost_SR)           # Aggregation of all of this
    return(met)
}

perf_met_multi <- function(portf_returns, weights, asset_returns, t_oos, strat_name){
    J <- dim(weights)[2]                                        # Extract dimension of weights array 
    met <- c()                                                  # Initialize metrics dataframe
    for(j in 1:J){                                              # Compute the performance of each strategy
        temp_met <- perf_met(portf_returns[, j], weights[, j, ], asset_returns, t_oos)
        met <- rbind(met,+ temp_met)
    }
    row.names(met) <- strat_name                                # Rename of each row
    return(met)
}

returns_msr <- temp_data %>% select(Tick, Date, P_return) %>%   # Returns matrix of MSR strategy
  pivot_wider(names_from = Tick, values_from = P_return)
  
metrics <- rbind(perf_met_multi(portf_returns, portf_weights, returns, t_oos, strat),
       perf_met_multi(portf_returns_msr, portf_weights_msr, returns_msr, t_oos, "MSR"))
                                                                # Apply metrics function to all strategies
metrics %>% round(4)

The equally weighted portfolio serves as a benchmark and its metrics as reference. With the lowest turnover, it has the highest transaction cost. This means that its Sharpe ratio embedded some fees to maintain every month the same weight among each asset.

Overall, the performance results of the machine learning model is not impressive, they’re even disappointing. The machine learning based strategy which outperformed the benchmark with the wide dataset (which consists of a core of 30 of the largest US stocks) can’t stand up above the benchmark. Their Sharpe ratios are lower indeed, since they’ve a lower average return and higher volatility. Since we’re trying to predict the top 25%, we faced huge turnover 10.29 and 14.72 and transaction fees. One possible explanation is that we are not looking at the right metric and that we are maybe facing a false negative. A second is that 4 features aren’t enough to explain the behavior of 881 companies around the world, with each company working in specific economic conditions (tax reduction, inflation, GDP, unemployment rate, …).

The MSR strategy ends with the highest Sharpe ratio (1.10) and an average return lower than the benchmark (14.88% compared to 17.98%). This result is aligned with Markovitz theory because the allocation process tries to penalize the volatility in profit to the return, but it’s not sufficient in this case to make additional profit. Its turnover of 20.88 is the highest value among all the portfolios.

The last strategy, based on simple Ordinary Least Squared (OLS) regression ends at first place. With the same features as the machine learning process, it seems to be the best predictor. With an average return of 24.05% and of course a higher volatility than the benchmark (28.36% compared to 20.27%), its Sharpe ratio is more rewarding than the others strategies with a lower turnover. However, it’s the portfolio that experiences the maximum loss from a peak (highest maximum Drawdown).

CAPM <- function(benchmarks, returns, Rf, strat){                       # Compute alpha and beta of all the strategies
  J <- dim(returns)[2]
  met <- c()
  for (j in 1:J){
    beta <- cov(returns[,j], benchmarks)                                # Covariance between strategy and benchmark EW
    beta <- beta / var(benchmarks)
    alpha <- mean(returns[,j]) - Rf - beta*(mean(benchmarks)-Rf)        # Additional gain compared to the benchmark
    temp_met <- data.frame(beta, alpha)
    met <- rbind(met,+ temp_met)
  }
  rownames(met) <- strat
  return(met)
}
Assetmodel <- rbind(CAPM(portf_returns[,1], portf_returns, 0, strat),
                    CAPM(portf_returns[,1], portf_returns_msr, 0, "MSR"))
Assetmodel %>% round(4)

The alpha is the additional returns seeking by active portfolio managers and the beta is the systematic risk embedded in each strategy. We can see that all the strategies except the MSR are riskier than the benchmark and don’t provide additional return. Since those values are the mean during the trading period, we need to observe the cumulative return to better appreciate the trend of the last 5 years. The MSR strategy has the lowest sigma and indeed the lowest beta \(\beta_i = \rho_{im}*\frac{\sigma_i}{\sigma_m}\). In first impression, we can select it as the less risky strategy, however, we need to considerate the filter choose at each date among the to select the first, fourth group, according to the monthly moving average, which create huge transactions fees (sell the asset if its moving average is below the threshold, buy/hold but change weight if it’s above. Is the alpha of \(5.8‰\) rewarding to cover all the fees? With a performance below the EW, we’re not sure.

3.3 Deflated Sharpe ratio

The major risk of portfolio allocation model is a false positive. To avoid it, we should carefully choose the strategy to be implemented in the trading platform. In order to correct for in-sample results that seem too optimistic, it is customary to shrink Sharpe ratios by some factor, usually close to 0.5.
Bailey and Lopez de Prado (2014) propose an approach to evaluate the statistical significance of Sharpe ratios when \(M\) strategies have been tested. The following test procedure will assess the relevance of the highest sharpe ratio strategy, the MSR. \[t = \phi\left(\frac{(SR-SR^*)\sqrt{T-1})}{\sqrt{1-\gamma_3SR+\frac{\gamma_4-1}{4}SR^2}} \right),\]

where \(\gamma_3\) and \(\gamma_4\) are the skewness and kurtosis of the returns of the strategy and \(SR^*\) is the theoretical average value of the maximum SR over \(M\) trials:

\[SR^*=\mathbb{E}[SR_m]+\sqrt{\mathbb{V}[SR_m]}\left((1-\gamma)\phi^{-1}\left(1-\frac{1}{M}\right)+\gamma \phi^{-1}\left(1-\frac{1}{Me}\right) \right),\]

where \(\phi\) is the cumulative distribution function (cdf) of the standard Gaussian law and \(\gamma\) is the Euler-Mascheroni constant. The index \(m\) refers to the number of strategy trials and \(M\) is the total numbers of tested strategies. If \(t\) defined above is below a certain threshold (e.g., 0.95), then the \(SR\) cannot be deemed significant.

DSR <- function(SR, Tt, M, g3, g4, SR_m, SR_v){ # First, we build the function
  gamma <- -digamma(1)                          # Euler-Mascheroni constant
  SR_star <- SR_m + sqrt(SR_v)*((1-gamma)*qnorm(1-1/M) + gamma*qnorm(1-1/M/exp(1))) # Avg. Max SR
  num <- (SR-SR_star) * sqrt(Tt-1)              # Numerator
  den <- sqrt(1 - g3*SR + (g4-1)/4*SR^2)        # Denominator
  return(pnorm(num/den))
}
# Next, the parameters:
M <- nrow(metrics)              # Number of strategies we tested
SR <- max(metrics$Sharpe_ratio) # The SR we want to test
SR_m <- mean(metrics$Sharpe_ratio)        # Average SR across all strategies
SR_v <- var(metrics$Sharpe_ratio)         # Std dev of SR

g3 <- skewness(portf_returns_msr)         # Function from the e1071 package
g4 <- kurtosis(portf_returns_msr) + 3     # Function from the e1071 package
DSR(SR, Tt, M, g3, g4, SR_m, SR_v)        # The sought value!
## [1] 0.576204

Although this value is greater than 0.5 (\(\simeq 0.58\)), but it’s still far from the 0.95 confidence threshold. The SR of our best strategy (only in term of Sharpe ratio) is not statistically significant because its location compared to the distribution of all SR is not enough to the right.

3.4 Plot

plot_data <- (portf_returns+1) %>% apply(2,cumprod)             # Values
plot_data <- rbind(1, plot_data)                                # Initiate at one

plot_data_msr <- (portf_returns_msr+1) %>% apply(2, cumprod)    # Values
plot_data_msr <- rbind(1, plot_data_msr)                        # Initiate at one

plot_data <- cbind(plot_data, plot_data_msr)                    # Merge two dataframe
plot_data <- data.frame(t_oos, plot_data)                       # Adding the date
colnames(plot_data) <- c("Date", strat_all)                     # Changing column names
graph <- plot_data %>%                                          # Final formatting & plot
  pivot_longer(-Date, names_to = "strategy", values_to = "value") %>%
  ggplot(aes(x = Date, y = value, color = strategy)) + 
  geom_line() + theme_light() + geom_point() + labs(y = "Cumulative return") +
  ggtitle("Strategies returns") + 
  scale_colour_brewer(palette = "Set1") 

ggplotly(graph)                                                 # Ggplotly to manipulate the graph

The representation of the cumulative return exhibits the comments above. The Machine learning portfolio’s allocation ends with returns below 1.8 without ever surpass the benchmark during the trading period which ends at 2.18. The MSR crossed the EW barrier on September 2019 but declines directly. It experiences a lower impact during the Covid crisis, which permit to stand above the EW and the Regression from March 2020 to November 2020 before being overwhelmed by those two.




Conclusion: The aim of the project was to define Machine learning strategies and apply course modules on portfolio allocation and backtesting project. As we can see, we need to carefully choose our hyperparameters to maximize the outcomes of the fitting process without learning too fast on training sample and overfitting on data and lose prediction power. The results may seem disappointing, but they are realistic! It takes experience/experimentation to make ML algorithms deliver good results. To achieve our goal, we need to handle many impactful features (more than 4 of course):

  • inside: directly related to the comp : Market Capitalization, Price to Book, Debt to Equity, Profitability Margin, ESG score, Fama French factors,…
  • outside: related to the region where the company is located and the overall economy, the macroeconomic factors, the behavior of a related financial market such as commodities patterns which is involves in the company’s line of production chain.

Assuming we have built a good model, which have been tested on various out of sample dataset, we still need to be cautious about a recession period or a crisis. On March 2020, the world experiences the Covid-19 crisis and some companies are still struggling to bounce back 18 months year later. Regardless, the strategy implemented, as portfolio managers, we need to be prepared to handle this situation with a well prepared contingency risk management: long/short vanilla and exotic options, buy swap, explore bond market, …